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Abstract. Relations between the off thermal equilibrium dynamical process 
of on-line learning and the thermally equilibrated off-line learning are studied 
for potential gradient descent learning. The approach of Opper to study on- 
line Bayesian algorithms is extended to potential based or maximum likelihood 
learning. We look at the on-line learning algorithm that best approximates the 
off-line algorithm in the sense of least KuUback-Leibler information loss. It works 
by updating the weights along the gradient of an effective potential different from 
the parent off-line potential. The interpretation of this off equilibrium dynamics 
holds some similarities to the cavity approach of Griniasty. We are able to 
analyze networks with non-smooth transfer functions and transfer the smoothness 
requirement to the potential. 



PACS numbers: 84.35+1 89.70.+C 05.50.+q 

The application of Statistical Mechanics to the study of learning in Neural 
Networks (NN) stems from the fact that the extraction of information from data 
(examples) can be modeled by a dynamical process of minimization of an energy 
function, possibly in the presence of (thermal) noise. In the case where the system is 
allowed to equilibrate, roughly all the possible information has been extracted from 
the data by the learning algorithm. In a very important sense learning theory is 
different from e.g. magnetism. In the latter the interactions are fixed by the physical 
constraints, and the equilibrium state and how it is reached is the object of study. In 
the former, the energy function can be chosen in order to achieve a certain property in 
the equilibrium state, such as largest possible typical generalization or memorization 
capability. 

Techniques originated in the study of disordered systems, such as the replica 
and cavity methods, TAP equations, as well as Monte Carlo techniques, have been 
borrowed and extended, leading to several results in what has become known as 
Off-line learning (OfL). Since disordered systems may take too long to equilibrate, 
implying a high computational cost, the search for efficient nonequilibrium learning 
algorithms has been undertaken. An interesting class of methods - where essentially, 
examples are used one at a time - is collected under the name of On-line learning 
(OnL)||l|. These bring the possibility of efficient performance and low computational 
cost. 

Recently Opper |16 offered a new theoretical way of studying the relation 
between OfL and OnL. He applied his ideas to Bayes learning. The posterior 
probability distribution for the set of weights obtained after T examples is used as 
the prior for the next example. If the full posterior is maintained, any calculation 
amounts to an OfL one. But by projecting the posterior into a restricted family of 
parametric distributions, huge computational gains can be achieved, transforming the 
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process into an effective OnL one. Now only a set of parameters and an auxiliary 
set of hyperparameters have to be updated. The changes in the hyperparameters 
induce automatically an effective annealing of tensorial learning rates. In the case 
of continuous weights, he applied these ideas by projecting to a gaussian space of 
posteriors. Solla and Winther generalized it by extending it so that information 
about e.g. the binary nature of weights can be included in a consistent way. This is 
simply achieved by projecting into another family of posteriors and again imposing 
that the information loss be minimized. 

There is however no reason to limit these studies to the case of Bayes learning 
and the aim of this paper is to extend Opper's method to include the problem of 
learning by gradient descent. We obtain equations that describe the evolution of the 
weights and hyperparameters for general differentiable potentials. Then we look at 
some applications. We analyze the relation between the off-equilibrium and thermal 
equilibrium for a special case which is Bayes optimal with a nondifFerentiable transfer 
function, the noiseless Boolean perceptron, a case which cannot be treated by Opper's 
Bayesian analysis. The on-line algorithm is automatically annealed and we discuss how 
the annealing is related to a performance estimate. Finally, we apply the resulting 
equations to the same architecture but for a nonsmooth potential in order to study 
the resulting algorithm. 

Let Uk be an example. In the case of supervised learning it is to be thought 
of as an input-output pair yk = {Sk,crk) and we assume that the data pairs are 
generated by a map a — fw'{S) which might be deterministic or stochastic so as 
to include the possibility of noise corrupted data. For unsupervised learning or 
density estimation it is an input vector yk — Sk- The learning set is formed by /i 
such random examples I?^ = (yi, y2, ■ ■ • , drawn independently from identical 
distributions. The purpose of learning is to make an estimate w of the true N 
dimensional vector of parameters or weights w* . To do so a cost function or potential 
V{(7, fw{S)) = V (w, y) is introduced. Usually one seeks a minimum of the total energy 
E (w) — X]fc=i ^("'fci /tu(*S'/c)), so that learning is stated as an optimization problem. 
The additive form is adequate in the case of independent (or non interacting) examples. 
There is also the possibility that aside from the learning set, other information about 
the possible weight vectors is available. It might be encoded in the prior probability 
Po(w), that is, the probability that can be attributed to any w, of being the true 
parameter vector, based on information other than . The information contained 
in the prior and in the learning set can be taken into account simultaneously by using 
Bayes theorem and imposing the equivalence of the minimum energy prescription and 
that of maximizing the likelihood of the examples, which as shown by Levin et al 
leads to a functional equation whose solution is the Gibbs distribution : 

Pv{w\D,,) - ^p,{w)P{D^\w) (1) 

f \ -/SV^ V(w,yk) /r,\ 

where /3 measures the sensibility of the likelihood and of course plays the 
role of the inverse temperature and the partition function is given by = 
/po(w')P(i?Jw')d^w'. 

The problem has been thus formulated as one of Statistical Mechanics of 
disordered systems due to the random nature of the data. Spin glass behavior for this 
type of system has been found in many different cases. Estimation of parameters may 
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turn into a computational hard problem, as suggested by the long thermalization times 
encountered while doing Monte Carlo estimates. This also happens for the prediction 
of the output cr to a new (statistically independent) input vector. A neural network, 
on the other hand, once it has been trained, and a reasonable w been determined, 
permits rapid estimation of a. The fact that the determination of w, using the full 
Gibbs distribution, may itself be hard, seems to imply that there is no way out. 
However suppose a reasonable estimate has been achieved for a learning set -D^, then 
the incorporation of the information carried by a new example 1/^+1 can be efficiently 
and easily done at least in an approximate way. This is the idea behind OnL and we 
now study this from the same perspective Opper has used to analyze Bayes learning. 
That these estimates are in general hard to do, leads to an approximation of the Gibbs 
distribution Py(w|D^) by Pg (w|D^). The type of problem dictates what is a useful 
approximation. In many cases the fluctuations, at least for large /i will be gaussian and 
so we study this case. Still the approximation can be done in many ways. To limit the 
loss of hard gained information, as measured by the KuUback-Leibler divergence, 
we follow and project the current version of the Gibbs distribution to a 

gaussian with the same mean w(/^) and covariance Cij{fi). 

OnL proceeds by storing all the information in the previous /x examples in the 
vector w(/i). Other auxiliary quantities, (in this case the covariance Cij{fj)) usually 
termed hyperparameters will be needed and their natural appearance and evolution 
justify naturally the annealing of learning rates. 

The basic idea is to consider the Gibbs distribution as the prior for the new, the 
(/X + 1)*'' example. Even when Pv(w\D^j) is substituted by the gaussian Pg(w|_D^), 
in general Py(w|_D^+i) will not be gaussian. Therefore it is projected into a gaussian 
of mean w(/i + 1) and covariance Cy (/Lt + 1) The procedure can then be iterated to 
include the next example. Of course this update will change the covariance of the 
posterior, leading to new set of equations relating Cy (/Lt + 1) and Cij(/Lt). 

The introduction of a new example, if the system is allowed to thermalize, can 
be the starting point for a cavity analysis as studied by Griniasty [Q. We do not, by 
doing the gaussian approximation, allow the system to thermalize. 

In order to calculate the approximate change in the expected value of w , start 
with 

Py(w|P)^)e~''^('*^'^''+i) 

and substitute it by 

then project Py (vif|_Dp+i) to Pg{'w\D^+i). Call the initial conditions to this iteration 
procedure w(0) for the mean and for covariance, C(0). We call our current estimates 
of the weights and the covariance w(/i) and C(/i) respectively. Then 

w^iii + 1) = y" w,P<,(w|i?^+i)d^w, (5) 
Q,(m+1)= / {w,-w,{^l + l)){wJ-WJ{^i + l))Pg{^\D^+^)d'^W.{Q) 
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Let u measure the gaussian fluctuations of w around w(/i) 



w 



Jg-iu*C-iug-/3V(w(M)+u,yf, + i)^JVu 

Note that itiC"^"'*^ ^" = —Cijduj ^e^^"'*^ , then one integration by parts leads 
to 

f e-3"*c;-iu^^ g-/3y(w(A<)+u,j/^+i)^Ar^ 

+ f) = "Wiill) + C,-,- ; — 7 z , 

where a summation over repeated indices is implied. Note the very important 
assumption that the potential is differentiable. This prevents the application to some 
popular non differentiable potential based algorithms. However we can deal with 
networks with a nonsmooth transfer function. Then using 

a„ J (w + u) = 9,, J (w + u) , (7) 

the on-line algorithm that results is 

w,{^l + l)= mil^) + In < e-/5^(»(A')+") >, (8) 

where < • • • > means the average with respect to the gaussian distribution with zero 
mean and covariance Cij(/i) . 

The next step is to determine the evolution of the covariance. In terms of 
the gaussian distributed fluctuations Ui of zero mean and the variation Aw — 
Widi + 1) — Wi{p), given by equation (^) 

C^J{^l + 1) = y" ('^^ - AmO(%- - Awj)Pg{w\D^,)d^w'. (9) 
Now use the identity 



UiUjC 



" = Cye-5"*c"" + akC,idu,du, (e-^"'c"") , (10) 

then two integrations by parts and the use of eq. (|^) determines the prescription for 
the covariance update. 

a,{fi + 1) = (/i) + CMC,i{ti)dkdi In < e-^^(-(^)+") > . (11) 

On one hand, this set of equations describe a first (gaussian) approximation to the 
problem of OfL learning with the potential i?^ = ^ ('*^! 2/ai) ■ On the other hand 

it describes an OnL learning prescription for the update of the weight vector, and a 
set of hyperparameters which are useful in improving performance. 

We now consider the widely popular class of problems where the network is 
a classifier into two categories a — ±1 and the dimension of 5 is A^. We study 
the case where the potential V (A) is a differentiable function of the stability A = 
crw • S/V^- How is the resulting algorithm related to the usual OnL schemes? Let 
t = <TW • S/Vn, denote the stability of an example previous to its presentation 
to the network so that \ = t + au ■ S/\/N, the stability of example 5* in the 
network parametrized by w. Introduce Cij(p) — (3Cij{^) and x = SiCijSj/N. 
An explicit form for < exp(— /3y) > can be obtained. Introduce a 1 in the form 
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1 = / dXS — (Tw • S/y/Nj oc J dXdX exp iX yX — aw ■ S/y/Nj . A pair of quadratic 
integrations show that 

< exp{-(3V) >cx J dAexp-/3[F(A) + ^-^^^^], (12) 
thus for the estimate of the weights we have: 

^ w,i^^) + -^C,,i^i)SMl^)^t\n [ dXexp~(3[ViX)+^^^^],il3) 
pv N J 



while for the anneahng equation 

C.,iti+1) ^ C.,,i^I)+^CMCJli^i)SkSl^^\n J dXexp-P[ViX) + ^^^].{U) 

To compare to previous work we look at the zero temperature limit. The A integral 
can be calculated by the saddle point method. Let Xo (t) be the minimum of 
V{X) + (A - t)2/2x, that is the solution of 

'9V A-tl ^ 
5A + ^ = 

-I A = Ao 

then In < exp{-pV) >= -(3 (v{Xo) + . Define what we will show to be the 

effective on-line potential 

£^it)^ViX,)+^-^^^^. (16) 



Note that from eqs. ( p^ and ( |T^ ) it is easy to see that 



dV 
dX 



d£. {t) 



(17) 



The algorithm equations can now be written as 



mifi + 1) = mifi) - --^Cij{fi)Sja{n) , (18) 



1 - ~ B^f (f) 

+ 1) = (/i) - -cMc,i{fi)SkSi^^. (19) 



The update of w (eq. |18[ ) can be identified with an annealed (time or number of 
examples ^ dependent dj) tensorial learning rate Hebbian-like algorithm modulated 
by ^Uo, the gradient of the original potential calculated, not at the point t where it 
would be expected since it is the pre-training stability, but at the posterior stability 
Aq. However, the need to calculate the gradient at a future point Xo would render 
this algorithm useless. But in its stead (see eq. (p7|)) the gradient of a related 

potential is used. The OfL potential is transmuted to the effective OnL potential, and 
the gradient of the latter can be calculated at the accessible value of t. 

Equation ( p^ ) reminds others that have appeared in related but different places 
and a few comments are in order. It is not totally unrelated to those obtained in 
the cavity analysis of learning by Griniasty [Q. The cavity and replica methods 
are not constructive, they are used to determine the OfL performance of gradient 
descent learning algorithms. The parameter x plays the role of the stiffness parameter 
in the cavity analysis and that of a; = lim^^co /? (1 — g) in fhs replica (symmetric) 
calculations. With respect to the latter, Bouten et al. have, in their analysis of OfL 
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gradient descent learning, stressed the interpretation of replica results in terms of 
cavity arguments. 

But this effect of transmutation of potentials has been seen before in |^ 1 1 11 . These 
works were done in the context of the variational-optimization method. Its purpose 
is to determine a potential that leads to maximum performance by functionally 
extremizing a performance measure such as the generalization error with respect to the 
potential. For some architectures it has been applied to both OnL and OfL learning 
in the thermodynamic limit in order to determine maximum possible generalization. 
It was found [0, that for the single layer perceptron, equation (|l^) gives precisely 
the relation between the optimal generalization OnL and OfL potentials. The same 
relation holds in unsupervised learning Up to now this relation (eq.|l^) seemed 
little more than accidental, but now can be seen as a consequence of approximating 
OfL by the closest (in the sense of KuUback-Leibler divergence) OnL learning scheme. 

Equation ^ describes the annealing of the tensorial learning rate. Several works 
(e.g. ,||]) have stressed the need for an OnL learning rate annealing. The need 
comes from the fact that once an estimate is close to a minimum of the potential, the 
step size should be reduced in order not to overshoot. The analogous of an annealing 
rate in an OfL problem appears e.g. in where a performance is improved by 
choosing a parameter of the potential (there, the threshold k of a relaxation algorithm) 
from the knowledge of the size of the learning set. This appears automatically in the 
variational optimized potentials both OnL and OfL |^ . The origin of the need for 
annealing was thought to be the same. However, here, as in the work of Opper, it 
can be seen that even if an OfL potential is not annealed the imposition of minimal 
information loss will anneal the OnL learning rate. 

The case of the single layer perceptron with multiplicative noise, a nonsmooth 
model, is interesting and we discuss it a little further. In the OfL potential 
that implements the Bayes bound for generalization of Opper and Haussler was 
determined. If this potential is used in equations (fsf) and (|l9|) the Bayes OnL 
algorithm found by SoUa and Winther is reobtained p9| . They however could not 
claim that their algorithm was the gaussian approximation to the OfL Bayes because 
Opper's derivation (as theirs) is only valid for smooth models. However, it is quite 
tempting to study the resulting equations for nonsmooth models. From the point of 
view of designing learning algorithms this is certainly acceptable. We have shown 
that they can actually claim that the resulting algorithm , which they called Bayes 
OnL is the gaussian approximation to an algorithm which indeed saturates the Bayes 
OfL limit. For this model, the off-diagonal terms of the covariance tend to be smaller 
than the diagonal by a factor of ^/N. Asymptotically the covariance tends to become 
diagonal and the asymptotic performance - as measured by the generalization error - 
for iV — > oo is the same as that of the variational optimized algorithm. 

To understand how the annealing is working, we analyze a smooth potential V 
that is flat for large absolute values of the stability. For negative values it saturates 
at a positive value, while for positive stabilities it goes to zero. In the transition 
region it decays monotonically. This kind of potential is quite sensible, actually the 
optimal one we discussed above is of this type. The second derivative that enters the 
annealing equation is positive if the example is correctly classified, and negative if 
not. This means that the system is estimating on-line its performance. If in error, it 
reacts by increasing the estimate of the variance of the posterior distribution and in 
that manner, allowing larger corrections to be made to the current estimate w. When 
an example is correctly classified, then the system will start making smaller weight 
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estimate adjustments. Actually this is consistent with the idea, exposed e.g. 

that adaptive annealing schemes should depend on the estimate of the generalization 

error. 

^From an argument similar to Opper |l7j, the covariance annealing is governed 

by 

lim ^ = J^(w*), (20) 

where the matrix (Jy (w*))^^ — didj£x{t) , and the overbar indicates average over 
the examples distribution. This is not in general Fisher's Information matrix, but 
it is expected to be so for some cases. These include the additive noise case for the 
perceptron with the optimal potential |^ , the unsupervised learning case and the 
linear perceptron , where the OnL performance is asymptotically efhcient. It is 
expected to differ in cases such as the perceptron learning from a spherical distribution 
of examples in the presence of multiplicative noise, since then OnL can achieve only 
twice the error of the Bayes algorithm. It is possible that further studies of this system 
of equations can shed light on this exact factor of 2. 

FoUwing SoUa and Winther, we have not resisted the temptation to apply our 
algorithms to potentials which do not satisfy the conditions of smoothness. In 
particular an interesting case is the Perceptron algorithm of Rosenblatt applied to 
a perceptron in a noiseless student-teacher scenario. The OfL potential can be defined 
by Vr (A) = — A6 (—A), where A = aw.S/N. A possible prescription for the weights 
can be obtained by simulated annealing. The interest resides in the fact that the 
generalization error decays as a~^OiL but only as a~3 OnL. The relevant quantity is 
the effective OnL energy £x (t). The modulation function, —dt£x (t) is 

hm -dt\n / dXexp-[pVn{X) + ^-—^] = ^ , (21) 

J 2x V27rx H{^) 

where x = SiCijSj/N and H{x) — exp{~ / 2)dz / ^/2tt . This is surprisingly close 
to the optimal OnL modulation function. Even the annealing, which affects y is similar 
and from equations (|^, |l^ )the OnL generalization error decays as a~^. 

To conclude we have studied the first approximation OnL which is (Kullback- 
Leibler) closest to potential learning OfL. Somewhat surprisingly the OnL potential 
fx (t) is not the same as the OfL y(A).The most striking feature is that they depend 
on different quantities. The former on t, the stability prior to learning, and it could 
not be otherwise for the post presentation stability is unknown. The latter, on the 
stability, which will tend, in equilibrium to the OfL (equilibrium) post presentation 
stability. A second feature is expected, the energy consists of a pure energy term 
V associated to the new term plus another that reflects the presence of previously 
presented examples. 

We refer to this as a first approximation since a systematic expansion can be 
implemented [p^ . The infinite (formal) series shows that OfL equilibrium is attained 
by parameters and hyperparameters updates that involve only the effective OnL 
potential without making reference to the OfL potential. In connection to this we look 
at the question Q of what it means to learn OfL with a potential that is infinite for 
negative stabilities. Gradient descent can only start if the current estimate is within 
Version Space (VS)? This is the case of the noiseless perceptron optimal potential 
mentioned above ||ll[] . While this issue is not totally closed, a tentative answer starts 
by noticing that the effective OnL potential can be used even outside VS. A question 
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immediately follows, and it might be attacked in the future by the techniques of 
dynamical replicas If the effective OnL potential is used iteratively in learning 
from a restricted learning set, what will be the asymptotic time state? 
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